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FAST COMMUNITY DETECTION BY SCORE 

By Jiashun Jin* 
Carnegie Mellon University 

Consider a network where the nodes split into K different com- 
munities. The community labeis for the nodes are unknown and it is 
of major interest to estimate them (i.e., community detection). Degree 
Corrected Block Model (DCBM) is a popular network model. How to 
detect communities with the DCBM is an interesting problem, where 
the main challenge lies in the degree heterogeneity. 

We propose a new approach to community detection which we 
call the Spectral Clustering On Ratios-of-Eigenvectors (SCORE). 
Compared to classical spectral methods, the main innovation is to use 
the entry- wise ratios between the first leading eigenvector and each of 
the other leading eigenvectors for clustering. Let X be the adjacency 
matrix of the network. We first obtain the K leading eigenvectors, 
say, 771, ... , fjK, and let R be the tix (if — 1) matrix such that R(i, k) = 
?7fc+i(i)/?7i(i), 1 < i < n, 1 < k < K — 1. We then use R for clustering 
by applying the k-means method. 

The central surprise is, the effect of degree heterogeneity is largely 
ancillary, and can be effectively removed by taking entry-wise ratios 
between i)k+i and 771, 1 < k < K — 1. 

The method is successfully applied to the web blogs data and the 
karate club data, with error rates of 58/1222 and 1/34, respectively. 
These results are much more satisfactory than those by the classical 
spectral methods. Also, compared to modularity methods, SCORE 
is computationally much faster and has smaller error rates. 

We develop a theoretic framework where we show that under mild 
conditions, the SCORE stably yields successful community detection. 
In the core of the analysis is the recent development on Random 
Matrix Theory (RMT), where the matrix-form Bernstein inequality 
is especially helpful. 

1. Introduction. Driven by the emergence of online "networking com- 
munities" (e.g. Facebook, Linkedln, MySpace, Google+) and by the growing 
recognition of scientifically central networked phenomena (e.g., gene regu- 
latory networks, citation networks, road networks), we see today a great 
demand for methods to infer the presence of network phenomena, partic- 
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ularly in the presence of large datasets. Tools and discoveries in this area 
could potentially reshape scientific data analysis and even have impacts on 
daily life (friendship, marketing, security). 

Large complex network data sets pose an array of new problems where 
statisticians can make substantial contributions and scientific discoveries. 
In particular, solidly founded principled approaches are needed to make 
predictions and to detect structures. 

A problem that is of major interest is "network community detection" 
[7, 8, 9, 14, 22, 23, 24, 25, 32, 33]. Given an n-node (undirected) graph 
M = (V, E), where V = {1,2, ... ,n} is the set of nodes and E is the set 
of edges. We believe that V partitions into a small number of (disjoint) 
subsets or "communities". The nodes within the same community share 
some common characteristics. The community labels are unknown to us 
and the main interest is to estimate them. 

An iconic example is the web blogs data [1], which was collected right after 
the 2004 presidential election. Each node of the network is a web blog about 
US politics, and each edge indicates a hyperlink between them (we neglect 
the direction of the hyperlink so that the graph is undirected). In this net- 
work, there are two perceivable communities: political liberal and political 
conservative. It is believed that the web blogs share some common political 
characteristics (liberal or conservative, one supposes) that are significantly 
different between two communities, but are not significantly different among 
the nodes in the same community. 

1.1. Degree- corrected block model (DCBM). In the spirit of "all models 
are wrong, but some are useful" [5], we wish to find a network model that 
is both realistic and mathematically tractable. 

The stochastic block model (BM) is a classic network model. The BM 
is mathematically simple and relatively easy to analyze [3]. However, it is 
too restrictive to reflect some prominent empirical characteristics of real 
networks. For example, the BM implies that the nodes within each com- 
munity have more or less the same degrees. However, this conflicts with 
the empirical observation that in many natural networks, the degrees follow 
approximately a power- law distribution [12, 19]. 

In a different line of development, there are the p* model and the ex- 
ponential random graph model (ERGM) [12]. Compared to the BM, these 
models are more flexible, but, unfortunately, are also more complicated and 
so comparably much harder to analyze. 

DCBM is a recent model proposed by [18], which has become increasingly 
popular in network analysis [7, 8, 18, 30, 33]. Compared to the BM, DCBM 
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allows for degree heterogeneity and is much more realistic: for each node, it 
uses a free parameter to model the degree. 

The comparison of DCBM with the p* model and the ERGM [12, 19] 
is not obvious, given that all of them use a large number of parameters. 
However, in sections below, we propose a new spectral method where we 
show that in the DCBM, the degree heterogeneity parameters are largely 
ancillary: as far as community detection concerns, it is almost unnecessary 
to estimate these heterogeneity parameters. For this reason, the DCBM is 
much easier to analyze than the p* or the ERGM model. 

Perhaps the easiest way to describe the DCBM is to start with the case 
of two communities (discussion on the case of K communities is in Section 
2). Recall that M = (V, E) denotes an undirected network. We suppose the 
nodes split into two (disjoint) communities as follows: 

v = u y (2) . 

Let X be the n x n adjacency matrix of M . In the DCBM, we fix (n + 3) 
positive parameters (a,b,c) and {#( n )(i)}™ =1 and assume that 

• X is symmetric, with on the diagonals (so there is no self connec- 
tions); 

• The coordinates on the upper triangular {X(i,j) : 1 < i < j < n} are 
independent Bernoulli random variables satisfying 

b, otherwise. 

As n ranges, we assume (a, b, c) are fixed but 6^ n \i) may vary with n. The 
superscript "n" becomes tedious, so for simplicity, we drop it from now on. 
We call {6{i) : 1 < i < n} the degree heterogeneity parameters or hetero- 
geneity parameters for short. 
For identifiability, we assume 

max{a, 6, c} = 1, 9 max < g , 

where 9 ma x = maxi<j< n {0(i)} and go £ (0, 1) is a constant. 

It is probably more convenient if we rewrite the model in the matrix form. 
The following notations are associated with the heterogeneity parameters 
{#(i)}™ =1 and are frequently used in this paper. Let 6 and be the n x 1 
vector and the n x n diagonal matrix defined as follows: 



(1.1) 6 = (e{l),8(2),...,6{n))', e(i,i) = 9(i), l<i<n. 
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Moreover, for k = 1,2, let lj~ be the nx 1 indicator vector such that lfc(i) = 1 
if i € and otherwise. With these notations, we can rewrite 

X = E[X] + W, W = X-E[X], 

where E[X] denotes the expectation of X (also an n x n matrix), and 

e[x] = n- diag(n), n = e[aiii' + ci 2 i' 2 + b(ui' 2 + i 2 i'i)]e. 

Note that the entries in the upper triangular of W are independently (but 
not identically) distributed as centered-Bernoulli; such W is known as a 
"Wigner"-type matrix [29]. Note also that, while it seems are known, 
they are not as they depend on the unknown community partition that is 
of primary interest to us in this paper. 

1.2. Where is the information: spectral analysis heuristics. In [28], John 
Tukey mentioned an idea that can be served as a general guideline for statis- 
tical inference. Tukey's idea is that before we tackle any statistical problem, 
we should think about "which part of the data contains the information" : 
the "best" procedure should capture the most direct information containing 
the quantity of interest. 

In our setting, the quantities of the interest are the community labels. 
Recall that 

X = O - diag(O) + W. 

Seemingly, f2 contains the most direct information of the community labels: 
the matrix W only contains noisy and indirect information of the labels, and 
the matrix diag(O) only has a negligible effect, compared to that of fi. 

In light of this, we take a close look on £1. For k = 1,2, let 9^ be the 
nx 1 vector such that 

0( k )(i) = 0(i) if i £ V (k \ and 9 {k \i) = otherwise, 1 < i < n. 

For any vector x, let \\x\\ denote the £ 2 -norm. Write for short 

4 = ||0 W II/H0|I, * = 1,2. 

Note that can be interpreted as the overall degree intensities of the 

k-th community. 

Definition 1.1. We call an eigenvalue X of a matrix A simple if the 
algebraic multiplicity of X is 1 [15]. 
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In most part of the paper, the eigenvalues of interest are simple. The 
following lemma is a special case of Lemma 2.1, which is proved in Section 
7 (note that is the diagonal matrix as in (1.1)). 

Lemma 1.1. If ac ^ b 2 , then the matrix Q has two simple nonzero eigen- 
values 

^\\9\\ 2 (adl + cd\ ± ^{ad\ - call) 2 + Ab 2 d 2 d 2 ^j , 
and the associated eigenvectors rji and 772 (with possible non-unit norms) are 
6 (bd 2 2 ■ li + i [cdl - ad\ ± ^ {ad 2 - cd 2 ) 2 + Ab 2 d 2 d 2 ] 

The key observation is as follows. Let r be the n x 1 vector of the 
coordinate-wise ratios between rji and 772 (up to normalizations) 

%(i)/IMI 1 , . . 

mw/\\m\\ 

Define the n X 1 vector ro by 

r 1, iGV^, 

1.2) ro(«) = ^ ( ac q-cc%+y/ '(arfj-crf|) 2 +4b 2 dirf 2 N " 

_ 2f>di(2 2 



G l/( 2 ). 



Then by Lemma 1.1 and basic algebra, 

r oc ro- 

We are now ready to answer Tukey's query on "where is the information": 
the sign vector of r is the place that contains the most direct information of 
the community labels! 

The central surprise is that, as far as community detection concerns, the 
heterogeneity parameters {6(i)}f =1 are largely ancillary, and their influence 
can be largely removed by taking the coordinate-wise ratio of n\ and 772 as 
above (though r still depends on (9,n), but the dependence is only through 
the overall degree intensities d\ and c^)- This allows us to successfully extract 
the information containing the community labels without any attempt to 
estimate the heterogeneity parameters. 

Compared to many approaches (e.g., the modularity approach to be intro- 
duced below) where we attempt to estimate the heterogeneity parameters, 
our approach has advantages. The reason is that many real- world networks 
(e.g., web blogs network) are sparse in the sense that the degrees for many 
nodes are small. If we try to estimate the heterogeneity parameters of such 
nodes, we get relatively large estimation errors which may propagate to 
subsequent studies. 
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1.3. SCORE: a new approach to spectral community detection. The above 
observations motivate the following procedure for community detection, 
which we call Spectral Clustering On Ratios-of-Eigenvectors (SCORE). 

• (a). Let r/i and r/2 be the two unit-norm eigenvectors of X associated 
with the largest and the second largest eigenvalues (in magnitude), 
respectively. 

• (b). Let r be the nx 1 vector of coordinate-wise ratios: f{i) = r/2(i)/r/i(i), 
1 < i < n. 

• (c). Clustering the labels by applying the k-means method to the vec- 
tor f, assuming there are < 2 communities in total. 

The key insight is that, under mild conditions, we expect to see that, 

m ~ vi/hh, m~v2/\\m\\, 

where r/i and r/2 are the two eigenvectors of f2 as in Lemma 1.1. Comparing 
with (1.2), we expect to have 

f ~ r oc rQ. 

In Step (c), we use the k-means method. Alternatively, we could use the 
hierarchical clustering method [13]. For most of the numeric study in this pa- 
per, we use the k-means package in matlab. In comparison, the performance 
of the k-means method and the hierarchical method are mostly similar, and 
that of the latter is slightly worse some times. 

Note that since f is one-dimensional, both methods are equivalent to sim- 
ple thresholding. That is, for some threshold t, we classify a node i, 1 < i < n, 
to one community if f(i) > t, and to the other community otherwise. Seem- 
ingly, the simplest choice is t = 0. Alternatively, one could use a recursive 
algorithm to determine the threshold: (a) estimate the community labels by 
applying the simple thresholding to r with t = 0, (b) update the threshold 
with the estimated labels, say, following (1.2) with (a, b, c, d\, c?2) estimated, 
(c) repeat (a)-(b) with the threshold updated recursively. 

1.4. Applications to the web blogs data and the karate club data. We 
investigate the performance of the SCORE with two well-known networks: 
the web blogs network and the karate club network. The web blogs network is 
introduced earlier in the paper. The network has a giant component which 
we use for the analysis. The giant component consists of 1222 nodes and 
16714 edges. Each blog is manually labeled either as liberal or conservative 
in [1] which we use as the ground truth. The karate club network can be 
found in [31]. The network consists of 34 nodes and 136 edges, where each 
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node represents a member in the club. Due to the fission of the club, the 
network has two perceivable communities: Mr Hi's group and John's group. 
All members are labeled in [31, Table 1] which we use as the ground truth. 

Consider the web blogs network first. In the left panel of Figure 1, we 
plot the histogram of the vector f , which clearly shows a two mode pattern, 
suggesting that there are two underlying communities. In the right panel of 
Figure 1, we plot the entries of f versus the indices of the nodes, with red 
crosses and blue circles representing the nodes from the liberal and conserva- 
tive communities, respectively; the plot shows that the red crosses and blue 
circles are almost completely separated from each other, suggesting that two 
communities can be nicely separated by applying simple thresholding to f. 




Fig 1. The vector r (web blogs data). Left: histogram of r. Right: plot of the entries of r 
versus the node indices (red cross: liberal; blue circle: conservative). 

The error rate of the SCORE is reasonably satisfactory. In fact, if we use 
the procedure following steps (a)-(c), the error rate is 58/1222. The error rate 
stays the same if we replace the k-means method in (c) by the hierarchical 
method (for both methods, we use the built-in functions in matlab; the 
linkage for the hierarchical method is chosen as "average" [13]). 

Alternatively, we can use simple thresholding in step (c). In fact, the k- 
means method is equivalent to simple thresholding with t = —0.7. Moreover, 
the error rate is 82/1222 if we set t = 0, and the error rate is 55/1222 if we 
set t = —0.6 (this is the "ideal threshold", the threshold we would choose 
if we know the true labels; if only!). The results are tabulated in Table 1, 
along with error rates by some other methods, to be discussed below. 

We consider the karate network next. Similarly, in Figure 2, we plot the 
coordinates of f associated with the karate data versus the node indices, 
with red crosses and blue circles representing the nodes from the group of 
Mr. Hi and the group of John [31], respectively. Our method has an error 
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rate of 1/34 if in step (c) we either use the k-means method or the simple 
thresholding with t = (the error rate is 0/34 if we set t as the "ideal 
threshold"). See Table 1 for details. 
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Fig 2. Plot of the entries of r versus the node indices (results are based on karate club 
network; red cross: Mr. Hi's group; blue circle: John's group). 

1.5. Comparison with classical spectral clustering methods. As a spectral 
clustering approach, the success of the SCORE prompts the following ques- 
tion: would classical spectral methods work well too? By classical spectral 
method, we mean the following procedure. 

• (a'). Obtain the two leading (unit-norm) eigenvectors fji and 7)2 of X. 

• (b '). Viewing (771,772) as a bivariate data set with sample size of n, ap- 
ply the k-means method assuming there are at most two communities. 

Alternatively, one may use the following variation, which is studied in [25]. 

• (a"). Obtain an nxri diagonal matrix S defined by S(i, i) = ^2^—iX(i,j), 
1 < i < n. 

• (b"). Apply (a')-(b') to S'^XS' 1 / 2 . 

We call the two procedures ordinary PC A (oPCA) and normalized PC A 
(nPCA), respectively; PCA stands for Principle Component Analysis. 

It turns out that both PCA approaches work unsatisfactorily. In fact, for 
the web blogs data, the error rates of oPCA and nPCA are 437/1222 and 
600/1222, respectively, and for the karate data, the error rates are 1/34 for 
both methods. See Table 1 for details. 

The main reason why the two PCA methods perform unsatisfactorily is 
that, different coordinates of the two leading eigenvectors are heavily affected 
by the degree inhomogeneity; see Lemma 1.1. In the left panel of Figure 3, 
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we display the two leading eigenvectors of X, based on the web blogs data. 
The coordinates of two vectors are highly skewed to the right, reflecting 
serious degree heterogeneity. 




-0.16 -fl.14 ^.12 ^.1 -0.08 -0.06 -0.01 -002 -0.12 -0.1 -0.08 -0.06 -0.04 -0.00 -0.11 ^.1 -0.00 -0.03 -0.07 ^.06 -0,08 -0.01 -0.00 -0.02 ^.01 



Fig 3. Left: plot of the first leading eigenvector of X (x-axis) versus the second leading 
eigenvector of X (y-axis). Middle: plot of the first leading eigenvector of S~ 1 ^ 2 XS~~ 1 / 2 
(x-axis) versus the second leading eigenvector of S~ 1 ^ 2 XS~ 1 ^ 2 (y-axis). Right: zoom in of 
the middle panel. Results are based on the web blogs data, with red representing liberal and 
blue representing conservative. 

Somewhat surprisingly, though nPCA intends to correct degree hetero- 
geneity, the correction is not particularly successful. In the right two panels 
of Figure 3 (the rightmost panel is the zoom-in version of the panel to its 
left), we plot the two leading eigenvectors of S^ 1 ^ 2 XS^ 1 ^ 2 . It is seen that 
some of the coordinates of 172 are unduly large, compared to the remaining 
coordinates. 

The underlying reason for the unsatisfactory behavior of nPCA is two- 
fold. First, many nodes in the web blogs data have very small degrees, so 
S only contain relatively poor estimates for the heterogeneity parameters. 
Second, even when the heterogeneity parameters are given, it is not always 
helpful to correct them as in nPCA, because when we try to correct the 
degree heterogeneity, we tend to increase the noise level at the same time. 
See Section 2.8 for detailed discussion. 

1.6. Comparison with modularity methods. Modularity methods are well- 
known approaches to community detection. The methods have many vari- 
ants, including the well-known approach by [18]. For this paper, we use the 
recent approach by Zhao et al [33] , which can be viewed as a variant of the 
modularity method in [18]; the approach is reported to have similar behavior 
to that in [18]. 
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Fig 4. Histogram of errors by modularity methods for the karate data (left) and the web 
blogs data. The results are based on 100 independent repetitions. 

In principle, modularity methods are computationally NP-hard [3], as it 
searches exhaustively over all possible community partitions, and pick the 
one that optimizes the so-called modularity functional. To mitigate this dif- 
ficulty, many heuristic algorithms are proposed to approximate the theoretic 
optimizer. In particular, Zhao et al. [33] proposes a heuristic algorithm of 
this type which they call the tabu algorithm. 

Compared to the SCORE and the classical spectral approaches, the tabu 
algorithm is computationally much more expensive, and is increasingly so 
when the size or complexity of the network increases. The tabu algorithm is 
also relatively unstable: like most modularity methods, the tabu algorithm 
depends on the initial guess of the community partition, and the algorithm 
may not converge to the true partition with a "bad" initial guess. In theory, 
the instability can be alleviated by increasing the number of searches, but 
this is at the expense of substantially longer computational time. 

The performance of the tabu algorithm for the karate network and web 
blogs network is illustrated in Figure 4 (left: karate; right: web blogs), where 
for each network, the histogram is based on 100 independent repetitions (the 
error rates are random for each depends on the initial guess of the community 
partition, generated randomly). 

The most prominent problem of the tabu algorithm (and modularity 
methods in general) is that, in quite a few repetitions (9 out of 100 for 
the web blogs data, and 19 out of 100 for the karate data), the algorithm 
fails to converge to the true community partition and yields poor results. 
For the karate data, the number of clustering errors have a mean of 4.85 and 
a standard deviation of 5.7. For the web blogs data, the number of clustering 
errors have a mean of 104.5 and a standard deviation of 145.5. If we remove 
the "outliers" (the 9 outlying repetitions for the web blogs data and the 19 
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SCORE 


PCA 


Modularity 


£ = t = -0.7 £ = -0.6 


ordinary normalized 




web blogs 


82 58 55 


437 600 


104.5 (SD: 145.4) 


karate 


1 1 


1 1 


4.9 (SD: 5.7) 



Table 1 

Comparison of number of errors. For SCORE, thresholding at t = —0.7 is the same as 
k-means; t — —0.6 is the ideal threshold. The result of modularity method depends on the 

starting point and is random, where mean and standard deviation (SD) are computed 
based on 100 independent repetitions. The web blogs data has 1222 nodes and the karate 

data has 34 nodes. 

outlying repetitions of the karate data), then for the karate data, the errors 
have a mean of 2.1 and a standard deviation of 0.6, and for the web blogs 
data, the mean is 59, and the standard deviation of 2.4. See Table 1. 

In summary, it is fair to say that compared to SCORE, the tabu algorithm 
underper forms, especially as it is computationally much more expensive. 

That the tabu algorithm is more stable for the web blogs data than the 
karate data is unexpected (as the karate data has a relatively small size, 
we expect that it is relatively easy for the tabu to find the true community 
partition). One possible explanation is that the communities in the former 
is more strongly structured, so the algorithm converges faster for the web 
blogs data than for the karate data. 

1.7. Summary. We propose SCORE as a new approach to network com- 
munity detection when a DCBM is reasonable. The main innovation is to 
use the coordinate- wise ratios of the leading eigenvectors for clustering. In 
doing so, we have taken advantage of the fact that the degree heterogeneity 
parameters 6{i) are merely nuisance and we can largely remove their effects 
without actually estimating them. 

We have used the karate club data and the web blogs data to investi- 
gate the performances four algorithms: SCORE, oPCA, nPCA, and tabu. 
SCORE behaves much more satisfactory than the two PCA methods. It also 
outperforms the tabu algorithm: it is much faster, more stable, and has a 
smaller error rate. 

The paper is closely related to [25] (see also [9]), but is different in im- 
portant ways. The focus of this paper is on DCBM where the number of 
communities K is small, while the focus of [25] is on BM where K is large. 
Our analysis is also very different from that in [33] , for we use a much broader 
model for the degree heterogeneity parameters 9(i). 
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1.8. Content. The remaining part of the paper is organized as follows. In 
Section 2, we consider a i^-community network with a fixed integer K > 2. 
By delicate spectral analysis as in Sections 2.1-2.6, we lay out the framework 
under which the SCORE yields consistent estimates of the community labels. 
In Sections 2.7, we address the stability of the SCORE, and in Section 2.8, 
we re-investigate the nPCA by comparing the so-called Signal Noise Ratio of 
the nPCA and that of the oPCA. In Section 3, we suggest some extensions of 
the SCORE. The main results are proved in Section 4, where we outline main 
technical devises required for the proofs. Numeric investigation is continued 
in Section 5, where we compare the SCORE, two PCA approaches, and the 
tabu algorithm on simulated data. Section 6 discusses connection between 
SCORE and existing literatures. Secondary lemmas are proved in Section 7. 

1.9. Notations. In this paper, for two vector u,v with the same size, 
(u,v) denotes their inner product. For any fixed q > and any vector x, 
\\x\\ q denotes the ^ 9 -norm. The subscript is dropped for simplicity if q = 2. 
For any matrix M, \\M\\ denotes the spectral norm and ||M||_p denotes the 
Frobenius norm. For two positive sequences {a„}™ =1 and {b n }^ =1 , we say 
a n ~ °n if o, n /b n — > 1 as n — > oo, and we say a n x b n if there is a constant 
co > 1 such that 1/co < a n /b n < cq for sufficiently large n. 

In this paper, the notations 9 and O are always linked to each other, 
where 9 denotes the tixl vector of degree heterogeneity parameters and O 
denotes the nx n diagonal matrix satisfying that 0(i,i) = 6(i), 1 < i < n. 
Also, 9 min = mini< i < n {6'(i)} and 9 max = max 1 < i < n {9(i)}. For a vector £, 
when all coordinates are positive, we use OSC^) to denote the oscillation 
maxi<jj< n {£(i)/£(j)}. Throughout the paper, C denotes a generic positive 
constant that may vary from occurrence to occurrence. 

2. Main results. In this section, we consider the community detection 
problem where the network J\f = (V, E) has K communities. We start by 
describing the SCORE and the DCBM for i^-community networks, followed 
by spectral analysis on Q and on X, as well as the consistency of SCORE. 
We then address the stability of SCORE and conclude by re-investigating 
the normalized PCA. 

Throughout the paper, K > 2 is a known integer. See Section 6 for dis- 
cussion on the case where K is unknown. 

2.1. SCORE when there are K communities. Given an (undirected) net- 
work M = (V,E), we assume the network splits into K different communi- 
ties. That is, the set of nodes V partitions to K different (disjoint) subsets: 

V = V^UV^ ...UV^ K l 
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Let X be the adjacency matrix of A/", and introduce 

(2.3) M n K-l K = {M : n x (K — 1) matrix that has < K distinct rows}. 

For convenience, we use the following terminology, so that whenever we 
say "leading eigenvectors" or "leading eigenvalues", we are comparing the 
magnitudes of the eigenvalues, neglecting the ±1 signs. 

Definition 2.1. Fix 1 < k < n and an n x n symmetric X. We say 
Afc is the k-th leading eigenvalue of X if has the k-th largest magnitude 
among all eigenvalues of X, and we say is the k-th leading eigenvector if 
it is an eigenvector of X associated with 

The SCORE for if-community networks contains the following steps. 

• Obtain the K (unit-norm) leading eigenvectors of X: fji, 772, ■ ■ • , f)K- 

• Fixing a threshold T n , define an n x (K — 1) matrix R* such that for 
all 1 < i < n and 1 < k < K - 1, 

(2.4) 

r R(i,k), if\R(i,k)\<T n , 

R*(i,k) = < T n , if R(i, k) > T n , where R(i,k) = ■ 

{ -T n , if R(i,k) < -T n , 

• Let M* be the matrix satisfying 

M* = & Tgmm MeMn K i jR* - M|||. 

Write M* = [mi,m2, ■ ■ ■ , m n ]' so that is the z-th row of M*. Note 
that M* has at most K distinct rows, say, rrn 1 , mi 2 , . . . , nii K for some 
indices 1 < i% < . . . < i% < n. We partition all nodes into K commu- 
nities V^ l \V^\. . . , such that = {1 < j < n : rrij = m ik }. 

Note that the last step is the classical k-means method. We make the follow- 
ing remarks. First, when M* has less than K distinct rows, we let = 
for all {k + 1) < £ < K. Second, the choice of the threshold T n is flexible, 
and for convenience, we take 

(2.5) T n = log(n) 

in this paper. We impose thresholding in (2.4) mainly for technical conve- 
nience in the proof of Theorem 2.2. Numeric study in this paper suggests 
that no coordinate of R would be unduly large and so the thresholding 
procedure in (2.4) is rarely necessary. Last, provided that the largest K 
eigenvalues are all simple, the vectors rjk are uniquely determined, up to a 
factor of ±1. Correspondingly, all columns of R are uniquely determined, up 
to a factor of ±1; these factors do not affect the clustering results. 
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2.2. DCBM when there are K communties. The DCBM for i^-community 
networks is a direct generalization of the DCBM for two-community net- 
works. As before, we assume that the adjacency matrix X satisfies 

(2.6) X = E[X] + W, W = X-E[X], E[X] = fl - diag(O), 

where is a symmetric matrix, and W is the symmetric "Wigner" -type 
matrix where all diagonals are and the upper triangular entries are inde- 
pendent centered- Bernoulli random variables with parameters fl(i,j). 
In the core of DCBM is a K x K matrix 

A =( A (iJ)) 1 < i , j < K - 

For positive parameters {#(i)}™ =1 as before, we extend the n x n matrix f2 
to a more general form such that 

(2.7) Cl(i,j) = 6{i)6(j)A(k,l), if i e V<& and j G 

Similarly, for identifiability, we fix a constant go £ (0, 1) and assume that 

(2.8) max A(i,j) = l, < 9 min < 9 max < g , 

l<i,j<K 

where 9 min = mini<j< n {6>(i)} and max = maxi<j< n {6>(i)}. 
Throughout this paper, we assume 

(2.9) A is symmetric, non-singular, non-negative and irreducible. 

A matrix is non-negative if all coordinates are non-negative. See [15, Page 
361] for the definition of irreducible. 

In the analysis below, we use n as the driving asymptotic parameter, 
and allow the vector 6 (and so also the matrix O; see (1.1)) to depend 
on n. However, we keep (K, A) as fixed. Consequently, there is a constant 
C = C(A) > such that < C, where || • || denotes the spectral norm. 

Definition 2.2. We call Model (2.6)-(2.9) the K-community DCBM. 

The DCBM we use is similar to that in [33] (see also [18]), but is different 
in important ways. In their asymptotic analysis, Zhao et al. [33], model 
9(i) as random variables that have the same means and take only finite 
values. In our setting, we treat 9{i) as non-random and only impose some 
mild regularity conditions and moderate deviations conditions (see below). 
Additionally, Zhao et al. [33] need certain conditions on A which we don't 
require. For example, in the special case of K = 2, they require A to be 
positive definite, but we don't. See [33, Page 7] for details. 
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2.3. Spectral analysis of£l. We start by characterizing the leading eigen- 
values and eigenvectors of £1. Recall that 

e = (e(i), e(n))', v = v m u t/ (2) u . . . v (K) . 

Similar as before, for 1 < k < K, we let 6>W be the n x 1 vectors such that 

(2.10) 9^ k \i) = 9(i) or 0, according to i G W fc) or not. 

Let D be the K x K diagonal matrix of the overall degree intensities 

£>(M) = ||0( fc >||/||0[|, l<k<K- 

note that D depends on 9 and so it also depends on n. 

The spectral analysis on £1 hinges on the KxK matrix DAD, where D and 
A are as above. The following lemma characterizes the leading eigenvalues 
and leading eigenvectors of Q, and is proved in Section 7. 

Lemma 2.1. Suppose all eigenvalues of DAD are simple. Let Ai/||#|| 2 , 
'W||#|| 2 , ■■•> ^k/\\@\\ 2 be such eigenvalues, arranged in the descending or- 
der of the magnitudes, and let 0,1,02,... ,ok be the associated (unit-norm) 
eigenvectors. Then the K nonzero eigenvalues of ft are X±, X2, ■ ■ ., \k, with 
the associated (unit-norm) eigenvectors being 

K 

% = J> fc «/||^||]-#« k = l,2,...,K. 

i=l 

Note that (ak,r/k) are uniquely determined up to a factor of ±1; such 
factors do not affect clustering results. 

2.4. Spectral analysis of X . In this section, we characterize the leading 
eigenvalues and leading eigenvectors of X. 

Consider the eigenvalues first. The study contains two key components, 
one is to characterize the spectral norm of the noise matrix W , and the other 
is to impose some conditions on the eigen-spacing of the matrix DAD so 
that the space spanned by the K leading eigenvectors of O are stable up to 
noise corruption. 

For the first component, recall that 9 may depend on n. We suppose 

(2.11) (log(n)0 moa: ||0||i)/||0|| 4 ^O, asn^oo. 
Combining (2.11) with basic algebra, it follows that 

(2.12) log(n)/||fl|| 2 -> 0, (log(n)||0||i||0||!)/||0|| fl -> 0, 

which are frequently used in the proof section. The following lemma char- 
acterizes the spectral norm of W — diag(O). 
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Lemma 2.2. If (2.11) holds, then with probability at least 1 + o(n 3 ), 
\\W - diag(0)|| < 4Vlog(n)0 moa: ||0||i. 

Lemma 2.2 is proved in Section 7, where the recent result by [27] on 
matrix-form Bernstein inequality is very helpful. 

We wish that the K leading eigenvalues of X are properly spaced and 
all of them are bounded away from 0. To ensure that, we need some mild 
conditions on DAD. In detail, for any symmetric KxK matrix A, we denote 
the minimum gap between adjacent eigenvalues of A by 

(2.13) eigsp(-A) = min |Aj+i — A»|, Ai>A 2 ...>A^. 

l<i<K—l 

When any of the eigenvalues of A is not simple, eigsp(^4) = by convention. 
We assume that there is a constant C > such that for sufficiently large re, 

(2.14) eigsp(-DAD) > C. 

Additionally, we assume the degrees in each communities have comparable 
"overall degree intensities" , in that there is a constant hi > such that 

(2.15) max {||^)||/||^)||}<C. 

l<t,j<K 

As a result, D has a bounded condition number. Together with (2.9), this 
implies that all eigenvalues of DAD are bounded away from by a constant 
C > 0. Combining these with Lemma 2.1, the following lemma is a direct 
result of Lemma 2.2 and basic algebra (e.g., [2, Page 473]), the proof of 
which is omitted. 

Lemma 2.3. Consider a DCBM where (2.11), (2.14), and (2.15) hold. 
Let Xi, A2, • • Xk be the leading eigenvalues of X, and let Ai/||0|| 2 ; A2/||#|| 2 , 
. . ., Ax/||#|| 2 be the nonzero eigenvalues of DAD, both sorted descendingly 
in magnitudes. With probability at least l+o(n -3 ), the K leading eigenvalues 
of X are all simple, and 

max {\X k - X k \} < 4v / log(n)0 maa: ||0||i. 
l<k<K 1 1 

A direct result of Lemma 2.3 is that, with probability at least 1 + o(n~ 3 ), 



(2.16) A fc x ||6>|| 2 , for all 1 < k < K. 
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This result is frequently used in the proof section. 

Next, we study the leading eigenvectors. From now on, we assume Condi- 
tions (2.14)-(2.15) hold, and let Ai, A2, . . ., Xr be the K leading eigenvalues 
as in Lemma 2.3. For 1 < k < K , whenever A& is not an eigenvalue of 
W - diag(fi), let be the K x K matrix 

B {h) (i,j) = (iiflWiiii^iO-^^j'iin-cw-diftgCn))/^]- 1 ^'), 1 < i,j < k. 

If Afc is an eigenvalue of W — diag(fJ), let B^ k ' be the K x K matrix of 0. 

Lemma 2.4. Consider a DCBM where (2.11), (2.14), and (2.15) hold. 
Let {Xk}k = i be the eigenvalues of X with the largest magnitudes. There is 
an event with probability at least 1 + o(n~ 3 ) such that over the event, for 
each 1 < k < K , Xk is simple, and the associated eigenvector is given by 

K 

Vk = ^(a k (i)/P W \\)[In ~(W- diag^))/^]- 1 ^), 
i=i 

where a,k is an (unit-norm) eigenvector of DADB^ k \ and Afc/||#|| 2 is the 
unique eigenvalue of DAD B^ that is associated with a^. 

We remark that fjk do not necessarily have unit norms, and they are 
uniquely determined up to a scaling factor. Among them, r]\ is particularly 
interesting, where provided that the network N = (V, E) is connected, then 
all entries of 771 are strictly positive (or strictly negative). Also, the associated 
eigenvalue Ai is always strictly positive. These results are due to Perron's 
powerful theorem [15, Page 508]; see Section 2.7 for more discussion. 

2.5. Characterization of the matrix R* . We now characterize the matrix 
R*, defined as in (2.4). Let 771, 172, ■ ■ ■ , V}k be the K leading (unit-norm) 
eigenvectors of Q as in Lemma 2.1. Define an n x (K — 1) matrix R as a 
non-stochastic counterpart of R* by 

R(i, k) = rjk+i (i) jf]\ {%) , 1 < k < K — 1, 1 < i < n; 

note that ||^|| = 1. Unlike R, \R(i, k)\ < C for all i and k (see Lemma 2.1), 
so it is unnecessary to impose thresholding as that in (2.4). 

We wish to characterize \\R* — R\\f, where || • \\f denotes the Frobenius 
norm. To do so, we need to characterize \\fjk — rjk\\ and ||6~ 1 (?7fc — Vk)\\ (the 
latter is necessary because in the definition of R(i, k), we have rji (i) on the 
denominator, which will be shown to be at the magnitude of 9(i); see Section 
2.7 for details). 
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Towards this end, we need to put some conditions on (or equivalently 
on 0), namely the Moderate Deviation conditions on Vectors (MDV) and 
the Moderate Deviation conditions on Matrices (MDM), which are used to 
control the moderate deviations of norms of vectors, say, W6, and norms of 
matrices, say, _1 W ', respectively. The MDV requires that 
(2-17) 

E™ faf\ ^°&( n )Qmax \ s- n\\a\\ ™ / ^ ^°S( n )^max\ ^ C 

„ max{ * (,) ' ~mT ] - c¥h - § max % ' WP5F } - S WY 

for some constant C > 0, and the MDM requires that 

1 ( 1 1 ^ 1 

(2.18) i|Li< max{ -— Heili,^^ — }. 

The following short hand notation is used many times below. 

i=i 

The following theorem is the corner stone for characterizing the behavior of 
SCORE, and is proved in Section 4. 

Theorem 2.1. Consider a DCBM where the three regularity conditions 
(2.11), (2.14), and (2.15), and the two moderate deviation conditions (2.17) 
and (2.18) hold. If T n = log(n) is as in (2.5), then as n — >■ 00, with proba- 
bility at least 1 + o{n~ 2 ), we have that 

\\R* -R\\f < Clog 3 {n)err n . 

For general choice of T n , the result continues to hold if we replace the 
right hand side by C \og{n)T^err n . 

2.6. Hamming errors of SCORE. Recall that V = U . . . U 
is the true community partition. Introduce the n x 1 vector £ of true labels 
such that 

£(i) = k if and only if i e V {k \ 1 < i < n. 

For any community detection procedure, there is a (disjoint) partition V = 
, so we can similarly define the n x 1 vector of estimated 

labels by 

= k if and only if t G 1 < i < n. 

Especially, let £ sc = £ SC (X, T n , n) be the vector of estimated labels by SCORE. 
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For any £, the expected number of mismatched labels is 

n 

H p {U) = Y,p[Ki)^m)- 
i=i 

With that being said, we must note that the clustering errors should not 
depend on how we label each of the K communities. Towards this end, let 

(2.20) Sk = {vr : -/r is a permutation of the set {1, 2, . . . , K}}. 

Also, for any label vector £ where the coordinates take value from {1,2,..., K} 
and any ir £ Sk, let tt(£) denote the n x 1 label vector such that 

iv{l){i) = ir(£(i)), l<i<n. 

With these notations, a proper way to measure the performance of £ is to 
use the Hamming distance as follows: 

Hamm n (iU) = min H p (£,tt(£)), 

For k = 1, 2, . . . , K, let n k be the size of the k-th community: 

n k = \V<-% 

The following theorem is proved in Section 4, which says that SCORE is 
consistent under mild conditions, and is the main result of the paper. 

Theorem 2.2. Consider a DCBM where both the three regularity condi- 
tions (2.11), (2.14), and (2.15) and the two moderate deviation conditions 
(2.17) and (2.18) hold. Suppose as n — >■ 00, 

log 3 (n)err n / min{ni,n 2 , . . . ,n K } — > 0, 

where err n is as in (2.19). For the estimated label vector £ sc by the SCORE 
where the threshold T n = log(n) is as in (2.5), there is a constant C > 
such that for sufficiently large n, 

Hamm n (fV) < C log 3 {n)err n . 

Similarly, for general T n , the theorem continues to hold if we replace the 
right hand side by C log(n)T^err n . 
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2.7. Stability of SCORE. The performance of SCORE hinges on the ma- 
trix R defined in (2.4): 

R(i,k) = rjk+i(i)/m(i), l<*<n, \<k<K -I. 

Seemingly, SCORE could be unstable if the denominator fji(i) is small (or 
even worse, equals to 0) for some i. Fortunately, this is not the case, and un- 
der mild conditions, for most i (or for all i with slightly stronger conditions), 
771(2) x T]i(i) >c 9{i). Below, we further characterize the vector (f/k — Vk), with 
emphasis on the case of k = 1. 

We start with a lemma on 771 , which says a coordinate of 771 can never be 
exactly 0, as long as the network is connected. 

Lemma 2.5. Let X be the adjacency matrix of a network M = (V, E), let 
Ai be the eigenvalue with the largest magnitude, and let fji be the associated 
eigenvector where at least one coordinate is positive. If M is connected, then 
both Ai and all coordinates of 771 are strictly positive. 

Lemma 2.5 is the direct result of Perron's theorem [15, Section 8.2] on 
non- negative matrices, so we omit the proof. 

Next, for any nxl vector £ with strictly positive coordinates, define the 
coordinate oscillation by 

OSC(0= max {£{%)/&)}- 

The following lemma is proved in Section 7 (note that the i-th coordinate 
of O" 1 /?! is »7i(i)/0(O)- 

Lemma 2.6. Consider a DCBM where (2.14)-(2.15) holds. We have 

OSCiQ^rn) < C. 

The following lemmas constitute the key component of the proof of The- 
orem 2.1, but can also be used to obtain upper bounds for the number of 
"ill-behaved" coordinates of 771. These lemmas are proved in Section 7. 

Lemma 2.7. Consider a DCBM where the conditions of Theorem 2.1 
hold. With probability at least 1 + o(n~ 3 ), for all 1 < k < K , 

||i?fc-»/fc|| 2 <Clog(n)||0||i||e||l/||0|| 6 . 
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Lemma 2.8. Consider a DCBM where the conditions of Theorem 2.1 
hold. With probability at least 1 + o(n~ 3 ), for all 1 < k < K , 

- Vk)\\ 2 < Clog(n)err n . 

Recall that err n is defined in (2.19). For Lemma 2.8, a weaker bound is 
possible if we simply combine Lemma 2.7 and the fact that ||0 _1 || < l/#min- 
The current bound is much sharper, especially when only a few 9{i) are small. 

We now obtain an upper bound on the number of "ill-behaved" entries of 
7/i. Recall that OSC(@~ 1 t]i) < C. Fixing a constant Co £ (0, 1), we call the 
i-th enty of fji well-behaved if |7/i(i)/r/i(i) — 1| < cq (say). Let 

(2.21) S = S(c , 7/i, rji;X,tt,n) = {1 < i < n : \fji(i) / rji(i) - 1| < c }. 

The following lemma is a direct result of Lemma 2.6 and Lemma 2.8, so we 
omit the proof. 

Lemma 2.9. Consider a DCBM where the conditions of Theorem 2.1 
hold. Fix Co £ (0, 1) and let S be as in (2.21). Then with probability at least 
1 + o(n- 3 ), \V\S\ < Clog(n)err n . 

Therefore, as long as \og{n)err n /n — > when n — > oo, the fraction of 
"ill-behaved" coordinates of fji tends to and is negligible. 

In principle, provided that some stronger conditions are imposed, the 
techniques in this paper (especially those in the proof of Lemmas 2.7-2.8) 
can be used to show that with probability at least 1 + o(n~ 3 ), 

max 

l<i<n 

where Co S (0, 1) is a constant. In this case, Theorem 2.2 can be strengthened 
into that of with probability at least 1 + o(n~ 2 ), 

Hamm n (fV) = 0. 

Using terminology in the literature of variable selection [10], this says that 
the SCORE has the oracle property, means that it achieves exact recovery 
with overwhelming probabilities. 

2.8. Remarks on normalized PC A (nPCA). We revisit the normalized 
PCA (nPCA) discussed in Section 1.5. In the DCBM, on may expect that 
the normalization in the nPCA reduces the effects of degree heterogeneity. 
Somewhat surprisingly, this is not always the case. The point is that, when 



< co, 
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we try to correct the degree heterogeneity, we tend to innate the noise level 
at the same time. 

Perhaps the best way to appreciate this is to consider the "oracle" sit- 
uation where 9(i) are known. In this case, nPCA is almost equivalent to 
applying the ordinary PCA (oPCA) to the matrix @~ 1 / 2 XQ~ 1 / 2 . Write 

0-l/2 X@ -l/2 = e -l/2^ -l/2 + _ diag(^)]9- 1 / 2 . 

We call the ratio between the smallest eigenvalue of Q~ l l 2 VL®~ 1 / 2 (in mag- 
nitude) and the spectral norm of e- 1 / 2 [T^-diag(0)]6- 1 / 2 the Signal Noise 
Ratio for nPCA, denoted by nSNR. 

Similarly, we use oSNR to denote the Signal Noise Ratio in the case 
without normalization (i.e., the ratio between the smallest eigenvalue of f2 
(in magnitude) and the spectral norm of W — diag(f2)). If the normalization 
helps, then we should have nSNR > oSNR. 

Now, first, by Lemmas 2.1-2.2, we can roughly say that 

(2.22) oSNR=\\0\\ 2 /y/\og(n)6 max \\6\\ 1 . 
Additionally, we have the following observations. 

• Write Q-^ne^ 1 / 2 = e^Ej^i A^j)!^}® 1 ' 2 . A direct exten- 
sion of Lemma 2.1 is that, if Afc is a nonzero eig envalue of Q- 1 / 2 ^- 1 / 2 , 
then A fc x ||0||i, 1 < k < K. 

• A direct extension of Lemma 2.2 is that, ||6~ 1//2 [VF— diag(Sl)]0~ 1//2 || < 
C^log(n)n, with probability at least 1 + o(ra~ 3 ). 

Therefore, roughly say, 

(2.23) nSNR=\\e\\ 1 /y/log(n)n. 

Comparing the right hand sides of (2.22) and (2.23), in order for nSNR 3> 
oSNR, we need 

m i W e W 2 a nous imn4 

ttt — ^ r "max 1 1 17 1 1 1 > n v r. 



y/\og(n)n y/\og{n)e 



1 



Unfortunately, # m aic||#||i "C ?i||#|| 4 in many situations. Therefore, the nor- 
malization in nPCA does not always help, even in such an oracle situation. 

3. Variants of SCORE. The key idea underlying the SCORE is that, 
in a broad context, the K leading eigenvectors of the adjacency matrix X 
approximate those of the non-stochastic matrix fi, where the latter are 

k 

e(^2[a k (e)/\\e^\\]iX k = i,2,...,K. 
V=i ' 
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SCORE 


SCOREq 




q=l q=2 


web blogs (n = 1222) 


58 


61 64 


karate (n = 34) 


1 


1 1 



Table 2 

Comparison of number of errors for the SCORE and the SCOREq. 
It is seen that 

• The information of the community labels is contained in the term 
within the bracket, which depends on {9(i)}f =1 only through the over- 
all degree intensities 

• The diagonal matrix O does not contain any information of the com- 
munity labels. 

• Therefore, {9(i)}™ =1 are almost nuisance parameters, the effect of which 
can be removed by many scaling invariant mappings, to be introduced 
below. 

Definition 3.1. Let W C R K be a subset such that when x G W, then 
ax G W for any a > 0. We call a mapping M from W to R K scaling 
invariant ifM.(ax) = M(cc) for any a > and x G W. 

The following are some examples of scaling invariant mappings. 

• ( a ). W = {x G R K ,x(l) / 0}, and Mx = x/x(l); x(l) is the first 
coordinate of x. 

• (b). W = R K \ {0}, Mx = where q > is a constant. 

Given a scaling invariant mapping M, we have the following extension of 
SCORE. 

• Obtain the K leading (unit-norm) eigenvectors of X. Arrange them in 
annx K matrix R as follows so that ^ is the i-th row of R, 1 < i < n: 

R = [m,m,---,m] = ■••,&)' ■ 

• Obtain an n x K matrix R where the i-th row of R* is (M£j)'. 

• Apply the k-means method to the matrix R* for clustering with < K 
classes. 

For example, if we view each row of R as a point in R k and apply M in 
(a), then we have the R matrix in (2.4) associated with the original SCORE 
(except for that in (2.4), the first column is removed for it is the vector of 1 
and is thus non-informative for clustering). 
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For another example, we take M as the mapping in (b), and call the 
resultant procedure SCOREq, where q is the parameter in the mapping 

X — > x/\\x\\ q . 

We have investigated the performances of SCORE and SCOREq with the 
karate club data and the web blogs data, where we pick q = 1 and q = 2. 
The performances are largely similar, but SCORE is slightly better for the 
web blogs data. See Table 2 for details. In Section 5, we further investigate 
these methods on simulated data. 

A natural question is what could be the best M. In principle, this can be 
studied using the techniques developed in this paper, but the study involves 
higher order asymptotics and would be rather tedious. For this reason, we 
leave it for the future work. 

4. Proof of the main theorems. In this section, we prove Theorems 
2.1-2.2. The key for the proof is Lemmas 2.7-2.8, which contain bounds on 
\\Vk — T]k\\ and ||0 _1 (?7fc — r] k )\\, respectively. To show Lemmas 2.7-2.8, we 
need tight moderate deviation bounds on matrices and vectors involving the 
noise matrix W. Below, we first describe such moderate deviation bounds, 
and then give the proofs for Theorems 2.1-2.2. The proofs of Lemmas 2.7-2.8 
are given in Section 7. 

4.1. Moderate deviation inequalities on vectors and matrices. The fol- 
lowing theorem is proved in [27], which is the extension of the well-known 
Bernstein's inequality from the case of random variables to the case of ran- 
dom matrices. Recall that || • || denotes the spectral norm. 

Theorem 4.1. Consider a finite sequence {Z k } of independent, random, 
symmetric (real-valued) n x p matrices. Assume that each random matrix 
satisfies E[Z k ] = and \\Z k \\ < ho almost surely. Then for all t > 0, 



The following lemma provides moderate deviation bounds on ||0 W 



Lemma 4.1. // the Moderate Deviation condition on Matrices (2.18) 
holds, then ||0 _1 iy|| 2 < Clog(n) max{0~^ n ||0||i, B max Ya=i ekj}> with prob- 




where a 2 = max{|| Z k E[Z k Z' k ]\\, || E k E[Z' k Z k }\\}. 



ability at least 1 + o(n 3 ). 
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The following lemma provides moderate deviation bounds on the norms 
of various vectors. The lemma is proved in Section 7, where the classical 
Bennett's inequality is the key [26] (recall that is defined in (2.10)). 



Lemma 4.2. If the Moderate Deviation condition on Vectors (2.17) holds, 
then with probability at least 1 + o(n -3 ), for all 1 < k,£ < K, 

• \\We^f < Clog(n)||0||i||0||i. 

. we^wef <cio g (n)\\e\\l^ =11 ^. 

• \(eW)'W0W\ 2 < Clog(n)[||#||^ + log(n)^ 



We are now ready to show the two main theorems. 

4.2. Proof of Theorem 2.1. Let {r]i}fL l and {fn}f =l be as in Lemma 2.1 
and Lemma 2.4, respectively. Let S be the set of "well-behaved" nodes as in 
(2.21), where cq = 1/2 for simplicity. By Lemmas 2.7-2.9, there is an event 
E n such that P{E^) = o(n~ 3 ) and that over E n , for all 1 < k < K, 
(4.24) 

||% - mf < Clogin^eUeWl/Wef, WQ-Hm - %)H 2 < C\og(n)err n , 
and 

(4.25) \V\S\ < C\og(n)err n . 

Especially, combining (4.24) and (2.12), ||^|| ~ ||%|| = 1- To show the claim, 
it is sufficient to show that over the event E n , \\R* - R\\ 2 F < Clog 3 
To this end, we write 

\\R* - Rf F = Ux + U 2 , 

where U\ is the sum of squares of the £ 2 -norms of all "ill-behaved" rows of 
R* — R, and U2 is that of all "well-behaved" rows. 

Consider U\. For any i ^ S and 1 < k < K — 1, it is seen that \R*(i, k)\ < 
T n and \R(i, k)\ < \rjk+i(i) / Vl(^)\ ^ C> where T n = log(n) and we have used 
Lemma 2.1 and (2.15). Combining these with Lemma 2.9 and (2.14), 

(4.26) Ui < Clog 2 (n)\V\S\ < C log 3 (n)err n . 
Consider \J<i- Recall that for any % G S, 

(4.27) Mi)/Vl(i) - !| < 1/2- 

Since \R(i,k)\ < C, \R*(i,k) - R(i,k)\ < \R(i, k) - R(i, k)\. Write 

(4.28) R(i,k)-R(i,k) = #^%#-^# = i%/+//+m), 

Wvk+iW m(v miv ll%+ill 
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where 

i = 0?fc+i(*) - m+i(i))/vi(i), ii = Vk+i(i)(vi(i) - m(i))/(m(i)m(i)), 

and 

///=(l-||% +1 ||/||r) 1 ||)r /fc+1 (i)/ ??1 «. 

Recall that \\rj k \\ ~ 1 for all 1 < /c < K. 
Now, first, by Lemma 2.6, 

(4.29) \I\<C\fj k+1 (i)- Vk+1 (i)\/e(i). 
Second, write II = Ila + lib, where 

Ha = m+i(i)(Vi(i) ~ Vi(i)) IIb = [Vk+i(i) - r] k +i(i)][Vi(i) ~ m(i)] 

By Lemma 2.1, Lemma 2.6, and (4.27), \r] k+ i(i) /[f)i(i)r]i(i)]\ < CjO(%) and 
\lm(i)-m(i)]/Mi)Vi(i)}\ < C/6(i). Therefore, \IIa\ < C\rk(i) (*)!>(*) 
and |J/6| < C|% + i(i) — rj k+ i(i)\/6(i). Combining these gives 

(4.30) \II\ < C[\f)i(i) - + \f, k+1 (i) - m+1 (i)\]/9(i). 

Third, recalling \\rj k \\ = 1 and \\fj k \\ ~ 1 for all 1 < k < K and using triangle 
inequality, 



|i - 11%+ill/llmlll < lllmll - ll% + ill| < \\m - hx\\\ + 1 114+1 



fc+lll J 



where the right hand side does not exceed \\fj\ — r/i || + \ \fj k — rj k \\ . At the same 
time, recall that r] k+ i(i) / rji(i) < C. Combining these gives 

(4.3i) \m\ < c[\\m - fj k+1 \\] < c[\\m -t?!!! + ||% +1 - 

Inserting (4.29)-(4.31) into (4.28), \R{i,k) - R{i,k)\ does not exceed 

C ^W\ P*H(*) ~ + \vk+i(i) - m+i(i)\] + \\m - mil + Wm+i - m+ill)- 



Therefore, over the event E n , 

K 

U 2 < C^Oie- 1 ^ - %)|| 2 + n\\f) k - %|| 2 ). 
k=l 

Combining this with (4.24) gives 



(4.32) U 2 < C log(n) [err n + n\\ 



Note that n||#||i/||0|| 2 < £" =1 CW))- Therefore, n||6>||j]||#||i/||0|| 6 < err n 
by definitions. Combining this with (4.26) and (4.32) gives the claim. □ 
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4.3. Proof of Theorem 2.2. Without loss of generality, assume K > 2. 
The proof for the case K = 2 is the same, except for that R* , R, and M* 
are vectors rather than matrices, so that we have to change the terminology 
slightly. The following lemma is proved in Section 7. 

Lemma 4.3. The n x (K — 1) matrix R has exactly K distinct rows, and 
the I 2 -distance between any two distinct rows is no smaller than y/2. 

We now show Theorem 2.2. For 1 < i < n, let fj, rj, and Cj denote the 
i-th row of R*, R, and M* correspondingly. Fixing 5 = V2/3, we introduce 
a subset of V by W = {1 < i < n : ||fj — mj|| < 5, ||rj — mj|| < 5}. Recalling 
that V partitions to K communities V^, . . . , V^ K \ we note that W 
has a similar partition W = W^UW^U. . . W^ K \ where = V^nW, 
1 < k < K. 

By Theorem 2.1, there is an event B such that P(B C ) = o(n~ 2 ), and over 
the event B, 

(4.33) - Rf F < C log 3 (n)err n . 

Note that the n x {K — 1) matrix R has exactly K unique rows so R € 
M- n ,K-i,K- By how the k-means procedure is constructed, ||i2* — M*||jr < 
\\R* -R\\ F , and so \\R-M*\\ F < \\R* - R\\ F + \\R* - M*\\ F < 2\\R*-R\\ F . 
Combining this with (4.33), 

(4.34) \\M* - R\\% < 4\\R* - R\\% < C log 3 {n)err n . 

Combining this with (4.33), it follows from the definition of W that | V\W| < 
C log 3 (n)err n . Comparing this with the desired claim, it is sufficient to show 
that all nodes in W are correctly labeled; equivalently, this is to show that 
for any i,j&W such that i £ and j £ 

(4.35) mi = wij if and only if k = I. 

We now show (4.35). First, by definitions and (4.33)-(4.34), the cardinality 

of (V^ \ W^) does not exceed 5~ 2 J2iev(V (H* 5 * ~ r «ll 2 + ll m « ~ r *ll 2 ) - 
C log 3 (n)err n . Recall that we assume log 3 (n)err n ^ min{ni, ri2, • • • , Uk}- 
Combining these, is non-empty. Second, by Lemma 4.3 and definitions, 
for any i,j € W such that i £ W^ k \ j £ W^, where 1 < k,£ < K and 

k^e, 

(4.36) \\nii — mjW > \\ri — rj\\ — (\\mi — Ti\\ + ||m,- — rj\\) > 5. 

Therefore, if mj = mj for some i,j £ W, then there is a 1 < k < K such that 
i, j £ W^ k \ Suppose we pick one node jk from each W^ k \ 1 < k < K. By 
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(4.36), the K row vectors {rrij 1 , ...,m,j K } are distinct. Note that the matrix 
M* has at most K distinct rows, so if i,j £ W^ k ' for some 1 < k < K, then 
rrii = rrij. Combining these gives (4.35). □ 

5. Simulations. We have conducted a small-scale simulation study. 
The goal is to select a few representative cases to investigate the perfor- 
mance of SCORE, two PCA approaches, and the modularity methods. 

We compare 6 different algorithms, including three variants of SCORE, 
two PCA approaches, and one modularity method. The three variants of the 
SCORE are the original SCORE (SCORE), SCOREq with q = 1 (SCORE1), 
and SCOREq with q = 2 (SCORE2). The two PCA approaches are the 
ordinary PCA (oPCA) and the normalized PCA (nPCA). For modularity 
method, we use the tabu algorithm (tabu). Note that SCORE, oPCA, nPCA 
and tabu are studied in Section 1. 

For each simulation experiment, we fix integers (n, K) and let m = n/K; 
for simplicity, we pick (re, k) such that rre is also an integer. Fix a K x K 
matrix A and an n x 1 positive vector 6. For k = 1, 2, . . . , K, we let V^ k ' = 
{1 + m(k — 1), 2 + m(k — 1), ... , mk}, and let 1^ be the indicator vector of 
as before. Each simulation experiment contains the following steps. 

• (a). Randomly permute the coordinates of and denote the resultant 
vector by 9. Let O be the nxn diagonal matrix such that Q(i, i) = 8(i), 
1 < i < n. Define the nxn matrix $7 by Q = ©Efc£=i A(k, ^)lfcl^]0. 

• (b). Generate a symmetric nxn matrix W where all diagonals are 0, 
and for all 1 < i < j < n, W(i,j) are independent centered-Bernoulli 
with parameters Let X = £1 — diag(fi) + W, which can be 
viewed as the adjacency matrix of a network, say, N = (V, E). 

• (c). Remove all nodes that are not connected to any other nodes in 
A/", and denote the resultant network by A/"o = (Vo,Eq). Let X be the 
adjacency matrix of Nq, and let no be the size of Mq. 

• (d). Apply all or a subset of the 6 aforementioned algorithms to X. 
Record the Hamming errors of all methods under investigations. 

• (e). Repeat (b)-(d) for rep times, where rep is a preselected integer. 

In our study, no is usually very close to n so we do not report the exact 
values. Also, we set the threshold T n in (2.4) as oo so that we do not truncate 
any coordinates of R as usually none of them is unduly large; setting T n = 
log(n) gives almost the same results. The simulation includes 4 different 
experiments, which we now describe separately. 

Experiment 1. In this experiment, we investigate how SCORE, oPCA, 
nPCA, and tabu perform with the classical stochastic Block Model (BM). 
We choose (n, K,rep) = (1000,2,50), A as the 2x2 matrix with 1 on 
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Methods 


oPCA 


nPCA 


tabu 


SCORE 


Mean (SD) 


.058 (.009) 


.055 (.010) 


.050 (.065) 


.058 (.009) 



Table 3 

Comparison of mean error rates (Experiment 1). In each cell, the number in the bracket 
is the corresponding standard deviation (SD); same below. 



Methods 


SCORE 


SCORE1 


SCORE2 


Mean (SD) 


.107 (.01) 


.107 (.01) 


.107 (.01) 




.054 (.008) 


.054 (.008) 


.054 (.008) 




.112 (.009) 


.111 .008 


.111 (.008) 




.122 (.119) 


.071 (.006) 


.071 (.006) 



Table 4 

Comparison of mean error rates (Experiment 2; from top to bottom: 2a-2d). 

the diagonals and 0.5 on the off-diagonals, and 9 as the vector where all 
coordinates are 0.2. This is a relatively easy case and all methods perform 
satisfactory and have similar error rates. See Table 3 for the results. 

It is noteworthy that in one of the repetitions, tabu fails to converge 
and has an error rate of 49.8%. Such outlying cases are observed in most 
experiments below; sometimes the fraction of outlying cases is much larger. 

Experiment 2. In this experiment, we compare SCORE, SCORE1, and 
SCORE2. The experiment contains four sub-experiments, Experiment 2a- 
2d, with different network sizes, degrees of heterogeneity, and numbers of 
communities, etc.. 

In Experiment 2a, we investigate with a BM model, where all coordinates 
of 9 are 0.1. Additionally, we take (n,K,rep) = (2000,2,100), and A as 
the 2x2 matrix where two diagonals are 1 and two off-diagonals are 0.4. In 
Experiment 2b, we take (n, K, rep) = (800, 2, 100), and A as the 2x2 matrix 
where the diagonals are 1 and two off-diagonals are .5. Also, we take 9 to 
be the vector where 9(i) = .025 + .475 x (i/n). In Experiment 2c, we take 
(n, K, rep) = (1200, 2, 100) and let A be the same as in Experiment 2b, but 
we take 9 to be the vector where 9(i) = .025 + .475 x (i/n) 2 . In Experiment 
2d, we take (n, K, rep) = (1500, 3, 100), and A as the 3x3 matrix where the 
diagonals are 1, ,4(1,2) = .4, A(2,3) = .4, and ,4(1,3) = .05. Also, we take 
9{i) = .015 + .785 x (i/n) 2 . The results are reported in Table 4, where it 
suggests that the behavior of three versions of SCORE are similar in various 
settings, except that in the last case, SCORE slightly underperforms than 
SCORE1 and SCORE2. 

Experiment 3. In this example, we compare the performance of oPCA, 
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Methods 


oPCA 


nPCA 


tabu 


SCORE2 


Mean (SD) 


.378 (.041) 


.165 (.084) 


.0636 (.123) 


.0695 (.004) 



Table 5 

Comparison of mean error rates (Experiment 3). 



nPCA, tabu, SCORE2 in the case where we have three communities. We 
take (n,K) = (1500,3) and so m = n/K = 500. We take rep = 25 instead 
of rep = 50 as the tabu algorithm is sort of time consuming. We take A as 
the 3x3 symmetric matrix where we have 1 on the diagonals, A(l,2) = 
0.4, ,4(2,3) = 0.4, and .4(1,3) = 0.05. We take 8 as the vector such that 
8(i) = .015 + .785 x (i/n) 2 , 1 < i < n. The results are reported in Table 5, 
which suggest that SCORE2 outperforms nPCA and oPCA in terms of error 
rates. The error rates of SCORE2 and the tabu are similar, but SCORE2 is 
comparably much more stable than the tabu. 

Experiment 4. In this experiment, we investigate how the heterogeneity 
parameters affect the performance of oPCA, nPCA, tabu and SCORE. We 
take (n,K,rep) = (1000,2,50), and A to be the 2x2 matrix that has 
1 on the diagonals and 0.5 on the off-diagonals. Fix cq = 0.5 and do = 
0.02. The experiment contains three sub-experiments, Experiment 4a-4c. In 
Experiment 4a, we take 8 to be the vector such that 8(i) = do+(co—do)(i/n), 
1 < i < n. In Experiment 4b, we take 8 to be the vector such that 8(i) = 
do + (co — do)(i/n) 2 , 1 < i < n. In Experiment 4c, we take 8 to be the vector 
such that 8(i) = c for 1 < i < n/2 and 8{i) = 0.02 for n/2 < i < n. Note 
that the heterogeneity effects are mild in Experiment 4a, but are much more 
severe than that in Experiment 4b-4c. 

The results are tabulated in Table 6. The error rates of oPCA and nPCA 
are usually higher than that of the tabu and the SCORE. The average 
error rates of the tabu and the SCORE are similar, but the tabu usually 
has a much larger standard deviation (some times ten times as large). The 
instability of the tabu algorithm is due to that it depends on an initial guess 
(generated randomly), and when the initial guess is "bad", the tabu may 
fail to converge to the true labels. 

In conclusion, the SCORE methods (include the original SCORE and 
two variants) have error rates much smaller than those of the two PCA 
approaches. The error rates of the SCORE and the tabu are usually compa- 
rable on average, but the stand deviation of the latter is usually a few times 
larger, showing that the tabu is comparably less stable. Additionally, the 
computation time of the tabu is much longer than that of the SCORE in 
matlab code. Therefore, it is fair to say that the SCORE outperforms both 
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Methods 


oPCA 


nPCA 


tabu 


SCORE 


Mean (SD) 


.066 (.021) 


.066 (.107) 


.042 (.064) 


.043 (.006) 




.292 (.014) 


.431 (.122) 


.138 (.080) 


.140 (.010) 




.254 (.034) 


.476 (.049) 


.139 (.074) 


.130 (.010) 



Table 6 

Comparison of error rates (Experiment 4; from top to bottom: J^a-JfC). 



the two PCA approaches and the tabu algorithm. 

6. Discussion. We propose SCORE as a novel spectral approach to 
community detection with a DCBM. The method is largely motivated by 
the observation that the degree heterogeneity parameters of the DCBM are 
largely ancillary. If we obtain the first K leading eigenvectors of the adja- 
cency matrix and arrange them in an n x K matrix R, then the heterogeneity 
can be largely removed by applying a scaling-invariant mapping to each row 
of R. SCORE is one of such methods. 

Compared to many existing methods for DCBM (e.g. [18, 25, 33]), a 
very different feature of SCORE is that it does not attempt to estimate the 
heterogeneity parameters or to correct the heterogeneity. This is especially 
important when many nodes of the network are sparse, in which case the 
estimates of the heterogeneity parameters are inaccurate and the estimation 
errors can largely affect subsequent studies. Additionally, when we tend 
to correct the heterogeneity effects, we also tend to inflate the noise level, 
resulting a smaller Signal Noise Ratio in spectral analysis. 

The theoretic conditions required for the success of the SCORE is very 
different from that in Zhao [33]. Zhao et al [33, Page 6] models the hetero- 
geneity parameters as random variables that assume only finite values and 
have the same means, which is relatively restrictive. We model the hetero- 
geneity parameter as non-stochastic vectors that may vary with the size of 
the network, and we only need some conditions on regularity and moder- 
ate deviations for consistency. Additionally, Zhao et al. [33] impose certain 
conditions on the K x K core matrix A which we don't require. 

The work can be extended to various directions. First, SCORE can be 
extended to a large class of methods that utilize a scaling-invariant mapping 
that operates on R row by row. Second, the DCBM can be generalized to 
more realistic models, where the spectral methods could continue to work 
well. For example, in work in progress [17], we have extended the method 
to bipartite networks and have seen nice results on the 110-th Senate and 
House voting network. Third, the ideas developed here can be used to tackle 
some other problems in network analysis (e.g., linkage prediction [12]). 
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In this paper, we have assumed the number of communities K as known. 
In many applications (e.g., the web blogs data and the karate club data), we 
have a good idea on how many perceivable communities are there, and such 
an assumption makes sense. In some other applications (e.g., coexpression 
genetic network [21, 20]), the situation is more complicated and we may not 
have a good idea on how large K is. Community detection for the case where 
K is unknown is an unsolved problem, even for low-dimensional clustering 
problems. A possible approach is to try our methods for different K, and 
see for which K the results give the best fit to the data. The study along 
this line is non-trivial and we leave it to the future work. 

Intellectually, this work is connected to the recent interest on low-rank 
matrix recovery and matrix completion; see for example [6]. In the area of 
low-rank matrix recovery, there is a tendency of using the so-called methods 
of nuclear-norm penalization to replace spectral clustering. Our finding says 
the contrary: spectral clustering can be effective, and what it takes to make 
it effective is some careful adjustment. Such findings are resonated in our 
forthcoming manuscript [16] , where we show that spectral clustering can be 
very effective in cancer clustering with microarray data provided that we 
add a careful feature selection step. In spirit, this is connected to several 
recent papers by Boots and Gordon; see for example [4]. 

7. Proofs. In this section, we prove all the lemmas in the preceding 
sections. 

7.1. Proof of Lemmas 2.1. Fix 1 < k < K. Let be the nonzero eigen- 
value of O with the k-th largest magnitude, let r\ k be one of the (unit-norm) 
eigenvector associated with and let a k be the K x 1 vector such that 
afc(i) = (0W/||0W||,7/fc). j n our moc iel, we can rewrite SI as 

(7.37) n = \\efj2( DA m,j)(^)(^)\ 

i,j=l " " ' " 

and so by basic algebra and notations, 
(7.38) 

At the same time, since = Afc%, 

(7.39) a k (i) = (o®/\\e®\\ >Vk ) = ^(e^/\\e^\\,n Vk ). 



\ 2 J2 (DAD)(i,j)a k (j)-^- 
i,j=l ! ' 
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Note that {9® /\\9®\\}? = 1 is an orthonormal base. Inserting (7.38) to the 
right hand side of (7.39) gives a k (i) = (\\0\\ 2 / \ k )Ef=i(DAD)(i, j)a k (j), or 
in matrix form, 

(7.40) DADa k = (\ k /\\ef)a k . 

This says that Afc/||#|| 2 is an eigenvalue of DAD and a k is one of the asso- 
ciated eigenvector. Moreover, inserting (7.40) into to the right hand side of 
(7.38) and recalling Ur) k = \ k rj k , 



I K 

//, = — n Vk = J2ak(i)9 (i) /\\e (i) \\, 



i=l 

and so ||afc|| 2 = \\r] k \\ 2 = 1. By our assumptions, all eigenvalues of DAD are 
simple. It follows that a k is unique determined up to a factor of ±1. □ 

7.2. Proof of Lemma 2.2. By the definition of DCBM and (2.8), we have 
that ||diag(fi)|| < 9 2 max , where 9 max < g < 1. Note that by (2.11)-(2.12), 
0mas||0||l — >• oo. It follows that 



||diag(fi)|| = o(y/\og(n)9 max \\9\\ 1 ). 

Therefore, to show the claim, it is sufficient to show that with probability 
at least 1 + o(n~ 3 ), 

\\W\\ <3Vlog(n)0 ma: r||e||i. 
Let &i be the n x 1 vector such that e%(j) = 1 if i = j and otherwise. 
Write W = Y.x<i<j<n z{i)j \ where z{i,j) = WiiJ^eie'j + e^e'J. Let a 2 = 
\\T,i<i<j<n E l( Z< " i,j) ) 2 ]\\- B V elementary statistics and (2.8), E[W 2 (i,j)} < 
9(i)9(j). At the same time, 

E[(Z^)) 2 ) = E[W(i,j) 2 ] ■ [e ie > + ejetf = E[W(i,j) 2 ] • feej + e^]. 

Combining these gives that a 2 < # m ax||#||i- Fix q > 0. Applying Theorem 
4.1 with ZW) = W{i,j)\eie) + e 3 e[], ho = 1, a 2 = \\ J2i<j E[(Z^) 2 }\\, and 
* = y / 2qlog(n)9 max \\9\\i, 



P(\\W\\ > ^logH^Hfllli) < 2nexp 



glog(ra) 



l + (l/3)V2glog(n)/(fl max \ \ u || LJ - 



Note that 9 max \\ 9\\i > ||#|| 2 , and that ||#|| 2 / log(n) — >■ oo as in the assumption 
(2.11). It follows that qlog(n) / (9 max \\9\\i) — > 0, and the claim follows by 
taking q = 9/2. 
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7.3. Lemma 2.4- Let Ai > A2 > ■ ■ • > Xk be the K nonzero eigenvalues 
of Q. By Lemma 2.3 and (2.14)-(2.15), for all 1 < k < K - 1, |A fc+ i - A fc | > 
C||#|| 2 , At the same time, by Lemma 2.2 and (2.11), it follows from basic 
algebra (e.g., [2, Page 473]) that with probability at least 1 + o(n -3 ), 

(7.41) |A fc -A fe |/||#|| 2 = (l), 

and so all the K eigenvalues are simple. 

Now, fixing 1 < k < K, let fj k be an eigenvector (the norms of which are 
not necessarily 1) associated with X k . Writing for short 6^' k ^ = [I n — (W — 
diag(fl))/Xk]~ 1 0^ l \ we let b k be the K x 1 vector such that 

b k {i) = {e®/\\e®if lh ), i<i<k, 

and let 

Since Xfj k = X k r] k and X = ft + (W — diag(O)), it follows that 

V k = [X k I n - (W - dmg(m^Vk. 

Recall that (e.g., (7.37)) O = llefE^i^^)^^^^/!!^!!)^ 00 /^ 00 !!)'- 
Combining these and rearranging, 



||0||2 K \ ( K \ 
(7-42) 1k = {^)^[{jyDAD){iJ)h(j)) ||0 (i) | 

Recall that B^{l,i) = {6^)'[I n -(W -diag(tt))/X k ]- l 6^ /{\\0^\\ ■ ||0(O||] = 
(0W)'0(MO/[||0to[| • \\9®\\]. Taking the inner product of two sides in (7.42) 
with 0W/||0W ||, it follows from the definitions of b k that for any 1 < I < K, 

||0||2 K ||0||2 n 

b k {i) = (1UL) B^(£,i)(DAD)(i,j)b k (j) = C±JL)J2(B^DAD)(e,j)b k (j), 
or in matrix form, 



g(i,k) 



x k ij=l x k j=1 



|2 

)B {k) DADb k 



X k 

This means that A/c/||0|| 2 is an eigenvalue of B^ DAD and b k is one of the as- 
sociated eigenvector. Recall that = {B^)' l b k . By basic algebra, Afc/||0|| 2 
is an eigenvalue of DADB^ k \ and a k is one of the associated eigenvectors. 
Especially, 

(7.43) DADb k = DADB^a k = [A fc /||0|| 2 ]a fe . 
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Inserting (7.43) into (7.42) and rearranging, 

K 

f lk = Y,h{i)-e^ k) /\\e {i) l 

i=l 

We now check the uniqueness of A^. By Lemma 7.1 to be introduced 
below and (2.12), \\B^-I K \\ F < C71og(n)||0||i||6<|||/||#|| 6 = o(l). By similar 
argument as in (7.41), eigsp(D AD) > C. Combining these with basic algebra 
(e.g., [2, Page 473]), all eigenvalues of DADB^ are simple, and bk (and so 
Afc, Sfc, and fjk) are uniquely determined up to some scaling factors. If we 
further require [[a^H = 1, then a^, and % are all uniquely determined, up 
to a common factor that takes values from {—1,1}. This gives the claim. □ 

7.4. Proof of Lemma 2.6. Recall that D is a diagonal matrix where the 
fc-th diagonal is the k-th coordinate of the K x 1 vector S n \ equalling 
ll^^ll/ll^ll) 1 < k < K; the superscript "(n)" emphasizes the dependence 
on n (same below). By Lemma 2.1, B^V = £f =1 [4 n) (£;)/||0( fc )||]l fc , where 

is the eigenvector associated with the largest eigenvalue of DAD. By 
(2.14), to show the lemma, it suffices to show that for sufficiently large n, 

(7.44) OSC{a { " ] ) < C. 

Note that in the special case where S n ^ does not depend on n, the 
claim follows directly by Perron's theorem [15, Page 508], since DAD is 
non- negative and irreducible. Consider the general case where d^ may 
depend on n. If (7.44) does not hold, then we can find a subsequence of 
n G {1,2,...,} such that along this sequence, there are two K x 1 vectors 
do and a such that (a) OSC(a^) — > oo, (b) d^ — > do, and (c) — > a. By 
the condition (2.15), OSC(do) < C, and a direct use of of Perron's theorem 
[15, Page 508] implies that OSC{a) < C. This contradicts with (a). The 
contradiction proves (7.44) and the claim follows. □ 

7.5. Proof of Lemmas 2.7-2.8. Write 6^ = [/^-(I^-diag^))/^]- 1 ^^ 
for short. In our notations, 

K K 

(7.45) m = J> fc (*)/||0«||]0« m = £[a*(O/ll0 (<) ll]* (< '*\ 

i=l i=l 

where a k are the eigenvectors of DAD and dk are the eigenvectors of DAD(B^). 
To show the claim, we first characterize ||dfc — a^H, and then characterize 
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Consider \\ak — a-k\\ first. Let Ik be the K x K identity matrix. The 
following lemma is proved below (implicitly, we assume that in Lemma 7.1, 
the conditions of Lemmas 2.7-2.8 hold; same for Lemmas 7.3-7.4). 

Lemma 7.1. With probability at least 1 + o(n~ 3 ), 

\\B^-I K \\ F <Clog(n)(\\e\\i-P\\l)/\\0\\ 6 - 

Note that by (2.11), the right hand side tends to as n — > oo. 

We also need a lemma on eigenvector sensitivity. Suppose U and Err are 
both symmetric K x K matrix where ||£rr|| < (l/2)eigsp(f7), so that all the 
eigenvalues of U and U + Err are simple. Let A^ > A^ > . . . > A$ and 

(2) (2) (2) 

X\ > \ 2 > ■ ■ ■ > X K be the eigenvalues of U and U + Err, respectively, 

and let £2 > £2 > • • • ' anc ^ £l j ■ ■ • > £iC ^ e * ne corresponding (unit- 
norm) eigenvectors, of U and U + -Err, respectively. The following lemma is 
proved below. 

Lemma 7.2. If \\Err\\ < eigsp(f/)/2, then for any 1 < k < K, \\^ - 
d 2) || <2^24^L. 

s fe 11 — v eigsp((7) 

Note that ||.DAD|| < C. Using Lemma 7.1 and basic algebra, with prob- 
ability at least 1 + o(n -3 ), 

\\DAD(bW) - DAD\\ < C\\(B<V) - I K \\ < Clog(n)(||0||i • ||#||i)/||0|| 6 . 

Applying Lemma 7.2 with U = DAD and Err = DAD[(B^) - I K ], it 
follows from the eigen-space condition (2.14) that with probability at least 
1 + o(n" 3 ), for 1 < k < K, 



\\a k - a k \\ < C\\Err\\ < Clog(n)(||6»||i • \\9\\] 
By (2.11), the right hand side tends to 0, so 
(7.46) \\a k - cr-fc 1 1 2 < \\a k - a k \\ < Clog(ra)(||0||i • 



Next, we consider ||#^' fc ) — The following lemmas are proved below. 

Lemma 7.3. With probability at least 1 + o(n~ 3 ), for all 1 < k,i < K, 
\\§m _ 0(i)f < Clog(n)||#||i||6>|| 3 /||#|| 4 . 



COMMUNITY DETECTION BY SCORE 37 
Lemma 7.4. With probability at least 1 + o(n~ 3 ), for any 1 < k, i < K, 

We now show Lemmas 2.7-2.8. Consider Lemma 2.7 first. By (7.45) and 
basic algebra, 

Wm-vkf <c(i + ii), 

where J = E*i ^Xll^y^ 

\\ a k ~ a k\\ 2 - Since a& has unit norm and ||0«|| 2 > C||#|| 2 , combining these 
with (7.46) and Lemma gives 

\\vk-m\\ 2 <ciog(n)i\\9\\i-Ml]/\\e\f, 

and the claim follows. 

Next, consider Lemma 2.8. Similarly, HG -1 ^ — f?fc]|| 2 < C(I + II), where 

i = e£i aim®-y m -o®]\\yp { n 2 ), ^ n = ^ ^-a^n^tiw^t- 

By Lemma 7.4 and similar argument, 

||0 [ % - % ]|| <Clog(n) w [^ — + — p j 4 + W -]- 
Note that n||0||i < ||0|| 2 £™ =1 grey, it follows that 

||0 [ m -r, k ]\\ <Clo g (n) w [^— )+ — w ], 

and the claim follows by the definition of err n ; see (2.19). □ 

7.6. Proof of Lemma 4-1- Similar to that in the proof of Lemma 2.2, let 
ej be the n x 1 vector such that ei(j) = 1 if i = j and otherwise. Write 
Q-±W = J2i<j z{hj \ wh ere ZW) = j)® -1 ^- + e^e'J. Let 

ex 2 = maxjll 

E[Z^ j \z^)% || E[{Z {i ^)'Z^}\\}. 

l<i<j<n l<i<j<n 

First, by (2.8) and basic statistics, E[W 2 (i,j)} < 0(i)8(j). It is seen 

E[Z^\Z^)'} = J B[W 2 (f,i)]G- 1 [e, e ;-+e J ^] 2 G- 1 = E[W 2 (i, j)^ 1 [e^+e^G- 1 , 

which is a diagonal matrix, where the i-th diagonal < 8(j)/8(i), the j'-th di- 
agonal < 6(i)/6(j), and all other diagonals are 0. Therefore, £i<«j<n 
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is a diagonal matrix where the i-th coordinate does not exceed 
and the matrix norm of which < 
Second, we similarly have 

E [(ZM)'ZM] = E[W 2 (i^)\{^ + e^e- 2 ^ + eje'l 

which is a diagonal matrix where the i-th coordinate does not exceed 6{i) /0(j), 
the j-th coordinate < 0(j)/6(i). As a result, Ei<i<j<„ E[{Z^)' Z^\ is a 
diagonal matrix where the i-th coordinate does not exceed 9{i) S?=i(l/^(i))> 
and the matrix norm < 9 max X/i=i(V^(0)- Combining these gives 

1 n 1 

a 2 <max{- 1| <9 1| i, Gmax^WlS^ = °°- 

Fix q > 0. Applying Theorem 4.1 with ho = l/0 m i n and t = ao\/2qlog(n) 
gives 



PdlG" 1 ^!! > o yj2q\og(n)) < 2nexp 



'log(n) 



l + (l/3)V2glog(n)0-! n /<7 O 



By (2.18), log(rt)# m 2 n < Oq, and the claim follows by picking q to be a 
sufficiently large constant. □ 

7.7. Proof of Lemma 4-2. Let Yi , Y2 , be independent random vari- 
ables with \Yk\ < b, E[Yk] = 0, and var(Yfc) < o\ for 1 < k < n. Write for 
short a 2 = a\ + . . . cr 2 . We claim that with probability at least 1 + o(l/n 3 ), 

n 

(7.47) \J2 Y i\ 2 ^ Clog(n)max{fj 2 ,log(n)6 2 }. 

i=i 

In detail, using Bennett's Lemma [26, Page 851], for all A > 0, 



n , 

p(\Y, Y i\ > A ) < 

i=l ^ 



2exp(-^A 2 ), A6<a 2 , 
2exp(-ff), A6>a 2 . 



where cq = ^(l); with ip as in [26, Page 851]; note that cq ~ 0.773. Now, 
when cr/6 > 2^/2 log( n), we take A = 2y/2\og(n)a. It is seen A6 < a 2 , and 
so 

n 

P{\J2Yi\ > A) < 2exp(-4c log(n)) = o(n" 3 ). 
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When a/b < 2y/2 log(ra), we take A = 861og(n). It is seen Xb > a 2 . It follows 
that 



in- 3 ) 



> ^ < 2exp(-4c log(n)) = 0(1 
i=i 

Combining these, except for a probability of o(n~ 3 ), 

n 

\J2 Y i \ ^ 2 V 2 log(n)al{a/b > 2 v /21og(n)}+8blog(n)l{<r/b < 2 V / 2~bg^)}, 



i=i 



and (7.47) follows. 

We now show Lemma 4.2. The last item follows directly from (7.47), and 
the proofs for first two items are similar, so we only show the second item. 
Let &i be the n x 1 vector such that ej(j') = if % = j and otherwise. 
Write ||W^W|| 2 = £JL x (e{W0W) 2 . For each fixed *. applying (7.47) to 
Yj = W(i,j)8W(j), b = 9 max , and a 2 = E[(£ jevW 9{j)W(i, j)) 2 }, we have 
that with probability at least 1 + o(l/n 3 ), 

\^W6^\ 2 < C71og(n)max{ ( x 2 ,log(n)CaJ- 

Now, direct calculation shows that a 2 < 0(i)\\6\\^. It follows that with prob- 
ability at least 1 + o{n~ 2 ) that 

n 

\\W6^\\ 2 < C71og(n)^max{^(i)||0|| 3 ,log(n)CaJ, 
i=i 

and the claim follows by the first MDV assumption in (2.17). □ 

7.8. Proof of Lemma 4-3. We expand R to be an n x K matrix by adding 
a column of ones to the left. For notational simplicity, we still call the matrix 
by R. It is sufficient to show that R has exactly K distinct rows, and the 
£ 2 -distance for each pair of such distinct rows is no smaller than y/2. 

With the new notations, since is a diagonal matrix, for any 1 < i < n 
and 1 < k < K, 

Vk(i) (6-V)« 



R(i,k) 



m (i) (G-V)(z)' 



where 771 , T72 , ■ ■ • ,T]k are the K leading eigenvectors of £1. Combining this 
with Lemma 2.1 and recalling that dj = ||#^||/||#||, 

• 7 \ Ej=i a fc(j')ijWM 

R(i,k) = — -f? , 

£7=1 aiU) 1 ^')/^ 
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which equals to ak(£)/a\{£) if and only if node i belongs to the £-th com- 
munity V®, £ = l,2,...,K.It is now evident that R has K distinct rows, 
each is one of the following row- vectors: 

-4(oiW,02W. ■ ■ ■ , *kW), £=l,2,...,K. 
ai{£) 

Fix k 7^ £. The square of the ^-distance between the vector ^pry (ai(/e), . . . , a^-(fc)) 
and the vector {a\(£), . . . , ajf(f)) is 

d> g + 4 g w - db> g 

Since ai, a2, . . . , or; form an orthonormal base, X/f=i = 1' SjLi a jC0 = 
1, and ^i=i a,j(k)a,j{£) = 0. Therefore, the square of the ^-distance between 
these two vectors is a^ 2 (k) + a^ 2 (£) and the claim follows since |ai(A;)| < 1 
and \ai{£)\ < 1. □ 

7.9. Proof of Lemma 7.1. Write for short U = diag(fi) and H = (W — 
diag(n))/A fc . For 1 < i,j < K, B^(i,j) = (9®)'[I n - H]~ l 9^ /(\\9^\\ ■ 
\\0W\\). By (2.15), for a&l<i<K, \\9®\\ x \\0\\. All we need to show is 
(7.48) 

\(o®)'[(i n - h)- 1 - i n ]e®\ < cioginneuewl/PlW i < m < k- 

note that (9^)' 9^ = \\9^\\ 2 if i = j and otherwise. 
Write 

(7.49) {0 {i) )'[{ln ~ H)- 1 - I n ]6® =/ + //, 
where 

/ = (9 {i) )'H9^ j \ II = {9 {j) )'H[I n - H]- 1 H9 {j) . 
Consider / first. By H = (W — U)/Xf~, we have 

(7.50) I = —(Ia-Ib), 

where la = (9^)'W9 (j \lb = (9^)'U9^\ First, by (2.8) and that all 9{i) < 
1, \Ib\ < H^lll < H^lll with probability at least 1 + o(n~ 3 ). Second, by 
Lemma 4.2, la < C^Jlogin) max{||0||g, y / log(n)9 2 aax }. Last, by (2.16), with 
probability at least 1 + o(n~ 3 ), |A&| x ||#|| 2 - Inserting these into (7.50) gives 

(7.51) \I\ < Cy/\og(n)max{\\6\\l ^log(n)9 2 max } /\\9\\ 2 . 
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Consider II next. First, by Schwartz inequality, \II\ < \\(I n -H)- l / 2 He^\\- 
|| (I n - H)- x l 2 H0M\\. Second, by (2.11) and Lemma 2.3, with probability 
at least 1 + o(n" 3 ), \\I n - HW' 1 / 2 < 1 and X k x \\0\\ 2 . Therefore, for any 
1 <i < K, \\(I n - H)~ l / 2 H9^\\ 2 does not exceed 

h^-iiy^iw-u^w 2 < c\\(w-u)6^\\ 2 /\\6\\ A < (iia+iib)/\\e\\\ 
K 

where Ila = \\W9^\\ 2 and lib = \\U0®\\ 2 . Now, on one hand, by Lemma 
4.2, with probability at least 1 + o(n~ 3 ), 

On the other hand, by basic algebra and that ||0^||oo < 1> 

||^)|| 2 <||C<II^III- 

Note that by (2.11)-(2.12), log(n)||0||i > ||6>|| 4 > 1. Combining these gives 
that with probability at least 1 + o(l/n 2 ), 

(7.52) |II|<Clog(n)||%||0||l/||0|| 4 . 

Inserting (7.51) and (7.52) into (7.49) gives that with probability at least 

l + o(l/n 2 ), 

(7.53) 

^'[(In-H)- 1 -!^] < ^^[m^liei^VloiWCJII^lP+ll^llill^llI]- 

Write max{||0||f , v^n)9 2 max }\\9\\ 2 < \\9\\U9\\ 2 + ^^)9 2 max \\9\? ■ First, 
since Halloo < 1, ||f iPll^li < \\9U\9\\l Second, by (2.11), ^^n)9 2 max \\9\? < 
< ll^lllll^lli- Inserting these into (7.53) gives (7.48) and the claim fol- 
lows. □ 

7.10. Proof of Lemma 7.2. By the assumptions and elementary algebra, 
it is seem that (Ajj. , £|[ ) and (A^,£j^) take real values. By definitions, 
(U + Err)^ ] = and so 

Since £2 5 • • • > Or } constitute an orthonormal base, we have 

(a^-^f = ^(Af-Af)) 2 ^,^) 2 > E(4 2) -A l (1) ) 2 (ef,d 1) ) 2 . 

i=l i^k 
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Combining these gives 

(7.54) J>i 2 > - A«) 2 (^ (1) ) 2 < {&'{Errf^ < \\Errf. 

By the assumption of ||-EVr|| < (l/2)eigsp(?7), for all 1 < % < K and i ^ k, 
— \j\ > (l/2)eigsp(C7). Inserting this into (7.54) gives 

J2(&\& } ) 2 <*\\Err\\ 2 /eigsp(U) 2 . 

i^k 

Since ($\tP) 2 = 1 - ,ef } ) 2 , ^ follows that (^f,^) 2 > 1 - 

4 1 1 J^rr 1 1 2 / eigsp ( C7) 2 , and the claim follows by basic algebra. □ 

7.11. Proof of Lemma 7.3. Write for short U = diag(fi) and H = (W — 
diag(Sl))/Afc. Similarly, write 

§(i,k) _ e (i) = _ H yl _ Jf j0(i) = y n _ H y± H Q(i). 
By Lemma 2.3, with probability at least 1 + o(n -3 ), \\H\\ = o(l), 

(7.55) woM - e^\\ < \\(i n - H^wwm^w < \\He®\\. 

Next, by (2.16), with probability at least 1 + o(n -3 ), x ||#|| 2 . It follows 
from basic algebra that 

(7.56) \\H0® || 2 < ^(1 + 11) < C(I + II)/\\e\\\ 

x k 

where I = \\WQ^\\ 2 and // = \\U6^f. Note that, first, since ||0||oo < 1, 



(7.57) // < \\e\\t < 



Second, by the assumption (2.17) and Lemma 4.2, with probability at least 
l + o(n~ 3 ), 

(7.58) J<Ctog(n)||0||i||0||!. 
Combining (7.57)-(7.58) with (7.55)-(7.56) gives 

(7.59) \\e^ - e^f < c\o g {n)(\\e\\i + i)IMII/ll#ll 4 - 

By (2.11) and basic algebra, ||0||i > \\6\\ 2 > log(n), and so ||0||i + l < 2||0||i. 
Inserting this into (7.59) gives the claim. □ 
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7.12. Proof of Lemma 7.4- Write for short U = diag(fi) and H = (W — 
diag(0))/A fc . Since (I n - H)' 1 - I n = H + H(I n - H)" 1 !!, it follows from 
definitions and basic algebra that 

(7.60) lle-^^-fl^H 2 = WQ-^iln - H)- 1 -I n }e {i) \\ 2 < 2(1 + 11), 
where 

/= WQ^HO^f, 11 = \\Q- 1 H{I n -H)- l H6 {e} \\ 2 . 

Consider I first. Similarly, by Lemma 2.3, with probability at least 1 + 
o(n~ 3 ), \ k x ||6*|| 2 , and so 

(7.61) /= ~\\e- x we® -e^ue^w < ^[ia+ib], 

where 

ia = iie- 1 !^!! 2 , ib = iie-W^i 2 . 

Now, first, since \\6\\oc < 1, 

(7.62) ib < \\e\\i < \\e\\l 

Second, by (2.17) and Lemma 4.2, with probability at least 1 + o(n~ 3 ), 

n 

(7.63) Ia <Clog(n)||0||! 

i=l 

Inserting (7.62)-(7.63) into (7.61) gives that with probability at least 1 + 

o(n~ 3 ), 



(7.64) /<C71og(n) 



Next, we analyze //. By definitions, 11 = M&^iW- U)(I n -H)' 1 ^- 

k 

H)6^\\ 2 . Recalling that X 1 1 6* 1 1 2 with probability at least 1 + o(n~ 3 ), and 
so by basic algebra, 

(7.65) II < -j-\\Q~\w - U)(I n - H)-\W - U)e^f < ppl/a • lib, 
where Ha = WG^iW - U)(I - H)~ l \\ 2 and lib = || (W - U)0® || 2 . 
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Consider Ila first. By Lemma 2.2, with probability 1 + o(ra -3 ), \\H\\ = 
o(l). Therefore, 

iia < we-'iw - u)\\ 2 < cfiie- 1 ^!! 2 + ne-^n 2 ] . 

Next, by (2.18) and Lemma 4.1, we have with probability at least l + o(n~ 3 ), 

r-f t){l) Vmin 

At the same time, it is seen ||0 -1 [/|| 2 < 1, which is much smaller than the 
right hand side of the equation above. Combining these gives 

n 

(7.66) IIa< Clog(n)max{0 max ^— , 11%}. 

i=l 

Next, we consider lib. Write 

lib = \\(w - u)9^f < c[||w^|| 2 + wue^f]. 

On one hand, by Lemma 4.2, with probability at least 1 + o(n~ 3 ), 

\\we^\\ 2 <cio g (n)\\e\\i-\\e\\l 



On the other hand, since ||0||oo < L by definitions and direct calculations, 

||^ (4) || a <«<l» 
Recall that that (2.11) implies \\6\\i > \\9\\ 2 > 1. Combining these gives 

(7.67) lib < Clog(n)[l + \\0\\i]\\e\\l < CloginMU ■ \\9f 3 . 
Inserting (7.66)-(7.67) into (7.65) gives 

(7.68) II < Clog 2 (n) MlM I max{fl max jffi> ^J 6 ^' 
Inserting (7.64) and (7.68) into (7.60), Ue -1 ^*'*) - 6^}\\ 2 does not exceed 



Clog(n)||0||| 



1 log(n)||6>||i ro A 1 1 „-,, , 



By (2.11), log(n)0 ma!E ||0||i/||0|| 4 0, and so 



|| @ -l[^)_^)]||2 < 



2 ^ Clog(n 



^ J_ 1 log(n)||0|| 2 
i=i 



21 



'run i 



and the claim follows. □ 
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